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Abstract 

We generalize the approach of iLiu and Lawrencd (|l999l ) for multiple 
changepoint problems where the number of changepoints is unknown. The 
approach is based on dynamic programming recursion for efficient calcu- 
lation of the marginal probability of the data with the hidden parameters 
integrated out. For the estimation of the hyperparameters, we propose 
to use Monte Carlo EM when training data are available. We argue that 
there is some advantages of using samples from the posterior which takes 
into account the uncertainty of the changepoints, compared to the tradi- 
tional MAP estimator, which is also more expensive to compute in this 
context. The samples from the posterior obtained by our algorithm are in- 
dependent, getting rid of the convergence issue associated with the MCMC 
approach. We illustrate our approach on limited simulations and some real 
data set. 

Keywords: Empirical Bayes, Forward-backward algorithm. Hierarchical Bayesian 
model, Monte Carlo EM. 

1 Introduction 

Change point models (CPM) have been one of the main research topics in statis- 
tics for many years. Inference on CPM is a difficult problem because it is an 
irregular model in the sense that the number of parameters in the model is 
not a priori fixed, thus traditional likelihood theory does not directly apply. In 
this paper, we aim to develop a Bayesian approach to CPM and investigate its 
computational properties. 

Traditional approaches to CPM are frequentist in nature. Difficulties asso- 
ciated with the frequentist approach include inferences on the change points as 
well as inferences on the parameters within each single segment. The difficul- 
ties result from the fact that the number of parameters in the model increases 
with the number of changepoints. The frequentist approach to inferences on the 
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number of changepoints consists mainly of adapting hypothesis testing frame- 
work for testing i changepoints against i + 1 changepoints. This approach is 
not only difficult to analyze but is inelegant mathematically as well. A large 
portion of the literature also only focuses on binary segmentation, that is, con- 
si ders only the case of at mos t one changepoint. The Bayesian approach is used 
in Rafterv and Akman ( 19861 ) , but it is limited to the case of o nly one change- 



point, and iterative binary segmentation approach is used in lYang and Kuo 
( 20011 ). where Bayes factor is used to decide whether to contin ue segmentin g 
the subsequence. One approach conceptually similar to ours is iGreen (|l995l) . 
where the author used reversible-jump MCMC for inferences on the segmenta- 
tion. That paper also demonstrated clearly that under the Bayesian framework, 
inferences are automatic once the model is specified. Maybe surprisingly, the 
posterior for Bayesian CPM can b e computed in closed form with polynomial 
time complexity as demonstrated in Liu and Lawrenc^ ( 1999( ) for DNA sequence 
segmentation. This approach can be generalized to continuously distributed ob- 
servations (as shown in this paper) by specifying conjugate priors. This model 
is much more flexible than previous approaches in that we can easily switch 
to other observational distributions, consider correlations among observations, 
and use higher-order polynomial model within each segment. We acknowledge 
that other approaches might also be extensible to these situations with further 
work, but we think the hierarchical Bayesian approach is simpler for adapting 
to different situations once the basic framework is laid out and conceptually 
more elegant in dealing with different applications. The estimation and infer- 
ential procedures can be made to be fully automatic with little or no human 
intervention. 

In this paper, we focus on the piece-wise constant model with observations 
following a normal distribution. It is straightforward to extend the model to 
other observational distribution, which we do not pursue here for both simplicity 
and specificity. Our Bayesian model can be described as follows. First, we have 
a prior for the segmentation together with the means and variances for each 
segment. These hidden parameters (means and variances for each segment) 
are drawn independently from a conjugate prior. Finally, the observations are 
generated independently from the normal distribution for each segment. 

Section 2 sets up the Bayesian model, details the recursion used for posterior 
computation and specification of the hyperprior, deals with missing data and 
some numerical issues. Section 3 demonstrates the effectiveness of our model 
with simulation and application. We end with a discussion in Section 4. 



2 Hierarchical Bayesian model 
2.1 Specification of the model 

We make use of a hierarchical model that uses a continuous mixture of nor- 
mals to model the values within each segment. That is, we put a prior on 
the mean and variance of the observations for the segments. It turns out that 
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the hidden parameters consisting of means and variances of our observations 
can be integrated out analyticaUy, and so can the location of the changepoints. 
This observation imphes our model has a computational complexity which is 

quadratic in the length of the sequence. 

Our model is based on the model in Liu and Lawrenc^ ( 1999f ). The modifi- 
cations and extensions we made to that model include the following: 

1. We extend the model to the case where observations are modeled as 
continuous random variables. This change essentially imposes no extra technical 
difficulty, if conjugate priors are used as in the discrete model, which makes 
explicit in tegration possible. 

2. In iLiu and Lawrenc3 ( 19991 ). the hyperparameters are assumed to be 
given, which may be unsatisfactory from a statistical point of view. We propose 
a principled method to estimate these hyperparameters within the empirical 
Bayes (EB) paradigm combined with a stochastic version of the EM algorithm, 
when there are multiple sequences so that estimation is feasible. We show that 
an essential step required by the EM algorithm is made possible by utilizing 
the forward recursion and backward sampling approach. We also propose a 
simple procedure for the automatic choice of the hyperparameters when only 
one sequence is available. 

3. Data with incomplete observations are frequently encountered in practice. 
The hidden variables, which include means and variances for each segment and 
location of the changepoints, can also be considered as missing data. In fact, 
a large amount of work has been done to address the difficulty in inference 
arising from the existence of hidden variables, which are regarded as missing 
data, including the EM algorithm which is specifically designed for this purpose. 
In this work, we use missing data to refer to the case that some regions of 
the sequence are totally unobserved, producing gaps in our data. Many well- 
known algorithms like imputation can be used to solve this problem. We show 
that within our framework, the inferences can proceed without recourse to the 
imputation procedure. This is done by using the most common trick of Bayesian 
inference - integrating as much as you can to get rid of all variables that one 
considers as nuisances. In contrast, the most frequently adopted strategy is to 
just ignore the missing observations. 

Specifically, the model can be described as follows. For a sequence y = 
{yi, 2/2, . ■ • , Un}, suppose the maximum number of changepoints is kmax, which 
is specified beforehand. A particular segmentation can be denoted by ^ = 
{ci, C2, . . . , Cfc}, 1 < ci < C2 < • • • < Cfe = n, where n is the length of the 
sequence, and Ci is the ith changepoint. For purpose of concreteness, with two 
neighboring segments, the last observation of the first segment will be considered 
as the changepoint between the two. Note also that we set = n, so the number 
of changepoints is the same as the number of segments with our notation. 

The recursion marginalizing over changepoints w hich we will presen t in th e 
following is an extension of the recursion used by [LhT and Lawrence! 
Marginalization over segmental means and variances is achieved by integrating 
over segmental means and variances in the hierarchical models. These marginal- 
ization steps greatly reduce dimensionality of the space of the unknowns. 
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We put a uniform prior on the number of changpoints and assume all seg- 
mentations with exactly k changepoints are equally likely. That is, 

P{k) ~ ^/kmaxj 

and p{A\k) = l/^^ if A has A: changepoints. (1) 

Given the segmentation A, the mean and variance for each segment is gen- 
erated from normal-inverse-x^ distribution 

2 

a^\A^ Inv-x{i^o,crl), t = l,2,...,k (2) 

Those parameters with subscript are hyperparameters that need to be specified 
or estimated from the data. We will refer to those /i^ and erf above as hidden 
parameters, and the segmentation A itself as the hidden variable. 

In our model, each segment has a different mean and variance associated with 
it, and these means and variances are independently drawn from the hyperprior. 
With segmentation A and fj,i,ai given for each segment, the observations are 
naturally modeled as normal with given mean and variance: 

where we used the notation y^^j = j7,;+i ^^- i ; TJ-i^- Note ([2]) is exactly the 

same conjugate prior that is used in iCelman et all (|l995h which makes analytic 



integration possible. 

In summary, our Bayesian model can be described as follows. First, we have 
one prior on the hidden parameters - the mean and variance for each segment. 
We also have a simple prior on the possible segmentations. Hidden parameters 
for each segment are drawn independently from the hyperprior conditioned on 
the segmentation. Finally, the observations are generated independently from 
the normal distribution for each segment. 



2.2 Recursion and inferences 

The hyperparameters O = {/io, ^o, t'Oi fol ^-re assumed for now to be known. 
Then the main goal of inference is to get at the posterior distribution p{A\y). In 
the following, we omit explicitly writing down the dependence of the probability 
distribution on the hyperparameters Q. We show the difficulty of computing 
the posterior distribution next and propose a dynamic programming recursion 
for solving it. 

By the Bayes formula, we have p{A\y) — p{y\A)p{A) / J2a' P(y\^')Pi^') ' 
where p(y\A) is the likelihood in which the hidden parameters are integrated 
out. By the independence oiyt in different segments conditioned on the hidden 
parameters, we have 

1=0 
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where \A\ is the number of segments in the segmentation A and cq = by 
convention. Each term in the product can be analytically evaluated thanks to 
the conjugate prior used: 



PiVa-.b) = 



1 EilBi-^-)^ 1 _fco(M-™£ 



i^a-a-(^+i)e-^d;.da^ (3) 

The part that gives us trouble in computing p(A\y) is the summation over all 
different segmentations. Evaluation of this sum in a brute-force manner is com- 
putationally prohibitive since the number of possible segmentations increases 
with n like n^^'"' . This is computationally intensive with even a sequence of 
reasonable length n with typical choice of kmax- This sum is sometimes called 
partition function which is the normalizing constant in general graphical models. 
Approximate computation of the sum can be achieved by sampling schemes that 
do not require knowledge of the normalizing constant. Importance sampling (IS) 
or Markov Chain Monte Carlo (MCMC) can be used, but it is difficult to come 
up with proposal distributions that is efficient enough for our purpose due to 
the high dimensionality o f A. Fortunately, a dyna mic programming recursion 
similar to the one used in lLiu and Lawrence! ( 19991 ) can reduce the complexity 
to 0{n^kmax) as we explain in the following. 

We denote by p{j : i, k) the prior probability that yj-i consists of k segments 
with j the first observation of the 1st segment, and i the last observation of the 
fcth segment. For simplicity, p(l :i,fc) is also written as A:). Letp(j, /c— l|z, fc) 
be the conditional probability that the previous changepoint is at position j 
given the fcth changepoint is at position i. Similarly, p{yi:j\i : fc) denotes the 
probability conditioned on the event that the subsequence from i to j has k 
segments, also abbreviated as p{yi;j\k). Then we have 



pr{yi.i\l : i has k changepoints (fc segments) ) 
= y~^pr(j/i:i, last changepoint before i is at j\i, fc) 

= ^PU,k- l\i,k)p{yi..j\k - l)p(?/j+i:,|l) 

j<i \k~l) 

(4) 

where on the third line above we used the independence of observations for 
different segments given the segmentation. All of the probabilities above should 
be understood as being conditioned on the hyperparameters, which we omit for 
simplicity in notation. 
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The base case when /c = 1 for the above recursion can be obtained by 
integrating out the hidden variables: 

P(yi:j|l) = j P(2/j:jlM,0-^)p(M,<7^l/«0,fco,l'0,Cro)rf/id<T^ 

The integration can be done analytically since we used the usual conjugate prior 
in the model. 

After p{y\k) is computed by the recursion, we can compute the inverted 
probabiHty p{k\y) using Bayes rule: 

p{k\yi:n) cx p{k)p{yi.,n\k) 

Almost all of the desired probabilities of interest can be easily obtained. For 
example, the marginal likelihood is just 

kmax 

PiVl-.n) = ^ Pik)p{yi:n\k) 

k=l 

We can also compute the marginal probability that a changepoint will occur 
atj, (1 <j< n): 

p{ck = j for some k\y) 

= \ P{j,k)p{n- j,K- k)p{yi..j\k)p{yj+i..n\K- k) (5) 

Pm-.n) l<k<K<k^a. 

Using the marginal likelihood, we can also compute the probability of a specific 
segmentation by p{A\y) = p{A)p{y\A)/p{y). Note here we only computed the 
probability of a given segmentation, which does not give us a sense of which A 
has high probability. The problem is the same as before — we simply have too 
many possible segmentations and it is not feasible to compute the probabilities 
for all of them. 

However, we can obtain exact and independent samples from the posterior 
distribution of the segmentations given the data. First we draw the number 
of segments k from the posterior p{k\y), which we have derived above, and 
set Ck = n. Then we can recursively sample backwards from the following 
distribution: 

P{ck-1 = j\yi:n,Ck = i) 

oc piyi:j\k-l)-p{yj+i:i\l)p{j,k-l\i,k) (6) 

The above derivation used Bayes rule, that the conditional probability is pro- 
portional to the joint probability. 

This forward recursion backward sampling procedure is reminiscent of the 
forward-backward algorithm for hidden Markov model (HMM), in which the 
forward step computes the probability of past observations summing over all 
previous states, and the backward step computes the probability of future ob- 
servations given the current state. This is also similar to the Viterbi algorithm, 
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where the forward step is used to find the optimal value of the objective func- 
tion, keeping track of the previous optimal state, and then backtracing is used 
to find the optimal states that lead to the optimal value. 

If a single estimator is desired, the most frequently used one is the maximum 
a posteriori estimate (MAP). Using dynamic programming, MAP estimate can 
be obtained by recursion. We let V^(z, fc) — maxp(ci, c^—i jc^ — ^^yiin 

), the 

maximum probability of the configurations that can be obtained for previous 
changepoints conditioned on the fcth changepoint. Given the location j of the 
(fc — l)th changepoint, V{i, k) is independent of the information contained in 
the segments 1, . . . ,k — 2, since the maximum probability configuration up to 
position j is summarized by V{j, k — I). So we have the following recursion: 

V{i,l) = 1 

V{i,k) = ma.xp{ck-i = j\ck = i,y) -Vij^k - I), k > 1 (7) 
(l){i,k) = argmaxp(cfe_i = j|cfc = i,?;) • fc - 1), /c > 1 

3 

and the optimal segmentation can be found by backtracing. First the optimal 
number of changepoints is determined by 

k — argmax V^(n, k) ■ p(fc|yi:„) 
Setting Cfc = n, we recursively find the previous changepoint: 

Cfc-i = (t>{ck,k) 

Note that the probability p{ck~i = j\ck = i,y) used in the recursion ([7]) is 
exactly what we used when sampling back. 

Even if k^ax is considered to be a fixed constant, we need to compute 
p{ck~i = j\ck = i,y) for i = 1, . . . ,n, j < i. So maximum a posteriori es- 
timate takes another 0{n^) computation time after the forward recursion. A 
simpler method to get at the posterior distribution is to make use of the sam- 
ples drawn. Drawing one sample only takes time 0{n) as can be seen from the 
derivation © above. Ding et al. ( 20051 ) argues forcefully about the advantage of 



using samples to characterize the posterior distribution for their RNA structure 
prediction problem. 

The samples can also be used to approximate the marginal probability of 
changepoint locations in equation ^ . We can simply count the fraction of sam- 
ples which has a changepoint at certain positions. In equation 0, p{yj-^-i;n\n—k) 
is not directly available after forward recursion, which only computed p{yi:i\k) 
for the subsequences starting from index 1, so another "backward recursion" 
will be required to obtain this probability. Thus the sampling approach is much 
easier to implement. One advantage of using samples is that we immediately 
get a sense of the uncertainty in segmentation by examining different samples 
from the posterior. 

The hidden variables, mean and variance for each segment, can also be es- 
timated. Given a sampled segmentation, the hidden parameters can be esti- 
mated for each segment independently. Given one segment with length I, the 
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prior on the hidden parameters is the normal-inverse-x^ distribution. The pos- 
terior distribution of the mean and variance is well known, which is in the same 
normal- inverse-x^ family with updated parameters ( Gelman et al. ( 19951 )): 



A* = 



ko + I 
ko + I 
vq + I 



+ 



ko + l 



y 



k = 



(8) 



.2 



ko + 1 



{y - Mo)^ 



Fix a position i in the sequence. After N samples are drawn, the posterior 
for the hidden parameters at position i is a mixture of N normal-inverse-x^ 
distributions, with each component corresponding to one sample: 



where fJ-^"\ k^^\ are the values computed as in ^ using the segment 

that contains i in the n-th sample. We can use the mean of this mixture dis- 
tribution as a summary statistics computed from multiple samples. Figure [T] 
shows the mean of fii, which looks like a smoothed version of the original data. 
For this visually complex sequence, changepoint model is best seen as a tool 
for function approximation or denoising. One possible usage of these estimated 
hidden parameters is to check model fit by computing the standardized residuals 
at each position. 

This simple model can be extended somewhat by taking into account the 
length constraint. The constraint on the segment length can be represented as 
an interval [Z, m], where / > 1 is the lower bound and m < n is the upper bound 
for the length of a segment. These bounds might come from expert opinion. 
Putting a lower bound might also make the segmentation result more robust to 
outliers. 

With length constraints on the segments, the simple combinatorial counting 
can no longer be used. A recursion is required to count the number of possible 
segmentations. Let S{i, k) be the number of possible segmentations of a se- 
quence of length i, with k changepoints (fc segments). S{i, k) can be computed 
using the following recursion: 



1 



N 



) 




S{i,k) 



Y,S{j,k-l)S{i-j,l) 



The main recursion now becomes: 



pr(yi:i|l : i has k changepoints {k segments)) 
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Figure 1: A data sequence (dotted) with the posterior mean of the hidden 
parameter /j, (sohd) computed from 500 samples 

= ^P{j,k- l\i,k)p{yi.,j\k- l)p{yj+i.,i\l) 
j<i 

Sij,k- l)S{i-j,l) 

Note that we can define p{yi:i\k) :— k)p{yi-i\k) (p is not probabihty), so the 
recursion can be done in terms of p, which becomes simply 

P{yi:i\k) = ^p{yi:j\k - l)p{yj+i.,i\l) 
j 

Actually, the recursion ([4]) can also be transformed into a simpler form by 
defining p appropriately. With this definition of p, other formulae also become 
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simpler. For example, the recursive backward sampling becomes 



P{Ck~l = j\yi:n,Ck = i) OC p{yi:j\k ~ 1) ■ p{yj + l.,i\l) 

2.3 Estimation of the hyperparameters 

We have up to now assumed that the hyperparameters O are fixed in advance. 
In some simple cases, those hyperparameters can be specified beforehand by 
an expert who has some idea about the magnitude of these parameters due to 
previous experience. Although this sounds like a simple task, prior elicitation 
is the most important and time consuming part of Bayesian analysis. Good 
priors are usually difficult to obtain, and the poorly chosen prior will usually 
bias the inference, although this practical issue does not seem to concern theo- 
retical statisticians. Empirical Bayes (EB) is a principled method to estimate 
hyperparameters in a Bayesian model, and we will discuss it in this subsection. 

For our hierarchical Bayesian model, the marginal probability of the sequence 
is p{y\Q), after summing over segmentations and integrating out hidden param- 
eters. This marginal probability is computed by forward recursion in section 
12. 2[ which takes 0{n^) time. Although dynamic programming recursion has 
provided an efficient way to compute this for fixed hyperparameters, it does not 
solve our problem. If we use some numerical software package to optimize this 
objective function, a large number of evaluations are probably required to find 
the optimum, which is too costly in terms of computation time. Also, many effi- 
cient numerical optimization algorithms require the gradient as an input, which 
is also difficult to calculate for our problem. In the following, we will see an 
approach that makes the parameter estimation problem possible. 

As discussed above, we cannot directly perform maximization of the marginal 
likelihood to get an estimate for the hyperparameters since for each fix set of 
hyperparameters, evaluating the likelihood would require performing the recur- 
sion from scratch, which is too expensive since the optimization procedure will 
require many such evaluations. A natural approach to avoid this is to use the 
expectation-maximization (EM) algorithm and regard the segmentation as the 
missing variables. The EM algorithm is an efficient method dealing with missing 
data problems. In the E-step, it computes the sufficient statistics of A in the 
complete model \ogp{A,y\Q) under the distribution 9'-°'''^), where 

is the hyperparameters obtained from the previous iteration. In the M-step, 
we maximize over Q the expectation of the log-transformed complete likelihood 
conditioned on the old hyperparameters to obtain the new hyperparameters: 

Q(new) ^ argmax£;[logp(A,y|e)|y,e(°'''))] 

So the EM algorithm works by maximizing E[logp{A,y\Q)\y,Q^°'-'^'>)] with re- 
spect to Q, given the previous estimate 8*^°'''-', and then replace it in the next 
iteration with the new Q obtained by maximization in the previous step. The 
EM algorithm was shown to incr ease t he lik elihood in each iteration and will 
converge to the (local) maximum (IWul |l98l ). 
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The expectation E[logp{A, y\Q)\y, 8(°''^))] can be written out as 

^p(A|y,e(°'''))iogp(Ay|e) 

A 

Again, we are confronted with the problem of having to deal with the task of 
summing over all possible segmentations. To avoid this, we use a variant of EM 
called Monte Carlo EM (MCEM). The difference between MCEM and ordinary 
EM is that we use samples from the distribution p{A\y, to approximate 

the sum. That is, we replace the summation above with samples drawn from 
the posterior using hyperparameters obtained from the last iteration: 



1 ^ 

E[\ogpiA,y\e)\y,e^''"'^)] « _ ^ logp(A("), yje) 



where {A^"^}]^ are samples from O^"''''). But now, we can no longer 

guarantee that the algorithm will return a (local) maximum of the likelihood, 
or even that it will converge at all, although it generally works quite well in 
practice, and arbitrary accuracy can be obtained with a large number of samples. 
The sum J2n logp(A^"\ 2/1©) can be written as 

^iogp(A("),y|e) = ^iogb(^(")Mj;|A("),e)] 

n 

= ^ ^ logp(yj:j 1 1, 6) + constant 

" (jj)6A(") 

where the second sum is over all segments of A'"^. Notice in the computation of 
the above, we only need to evaluate the integration in ^ which has a closed-form 
formula, no recursions are required. Recursions are required for each iteration 
of the EM algorithm. Empirically, we observe 10 iterations seem to be sufficient 
for EM to reach convergence in our examples. 

The estimation of hyperparameters is feasible only when we have multiple 
sequences or a long sequence with enough number of segments. If this is not 
true, the following data dependent choice for the hyperparameters can be used 
and works well in our experience: 

fiQ : mean of the data 
fco : 0.01 

lyo ■■ 3 (9) 
(Tn '■ variance of the data 



These choices are the same as in iFralev and Rafterv ( 2007 ) with the same 



reasoning, except that we find empirically using variance of the data for ctq 
works better in our problem. 
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2.4 Missing observations 

Our model also yields a principled means to span the missing observations. 
Intuitively, when the gaps are small, we expect observations at the beginning 
and end of the gaps belong to the same segment, but this correlation across 
missing observations is unlikely to span long gaps. 

Let's suppose that we make no observations on a subset / C {i,i + 1, . . . , j} 
of a segment In the forward recursion, we need to compute the probability 
piUobsl^), where yobs is the observed values within which can be written 

as the integral over the missing observations 

p(yofcs|l) = J p{yi:j\l)dymi8 

We can simply ignore the missing observations in the computation of p{yi;j\l), 
the missing observation is taken into account only when we count the number of 
segmentations using the prior ([1]). This sounds more complicated than it really 
is. For example, the recursion and sampling can be done exactly as before by 
setting ^(yjiijall) = 1 if all values inside [ji, j2] are missing. 

We want to point out that in fact we made an important assumption which 
may not be obvious to the reader. Without any assumption, we have for all 
observed data yobs inside the ith segment 



p{yobs,r) = / piyobs, y^isllJ-i, )p{r\y^is, yobs)p{fJ-i, (t, |6)d/i<crj dy 



' ' 'mis 



where r is a vector containing binary variables indicating the positions of the 
missing observations. Because of the integration over ymis, this will potentially 
introduce complicated correlations between components oiyobs- So the integra- 
tion over and of generally cannot be computed. If we in addition assume 
that p(r|j/„ijs, yobs) = p{'^\yobs)^ that is all the information contained in r can be 
derived from the observed data, then it can be easily seen that integration over 
J/mis will not introduce any correlation between the observed data given /x^ and 
cr^, so p{yobs) can be computed by igno ring the missing data. T his assumption 



is called Missing At Random (MAR) ijLittle and RubinI (|2002l )). This is the 
working assumption under which our approach works. 

As a simple illustration, we simulated the observations as follows: 

y^^ N{-l,l),l<i<100 
y, ~ A^(-0.6,l),101 < i < 150 
y, - A^(l,l),151 < i < 250 

Note the changepoint between the first two segments are more difhcult to de- 
tect. Then we artificially insert some missing observations between the first two 
segments, so the observed data are: 

y, - A^(-l,l),l < i < 100 
y, ~ Af(-0.6,1),201 < i < 250 
y^ - A^(l,l),251 < i < 350 
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We generate 100 sequences of data from the first situation, and change the 
index of the observations to get 100 sequences with missing observations. We 
use the simple choice for the hyperparameters stated at the end of section 12.31 
For the first situation without missing observations, the algorithm detected the 
changepoint for 14 of the sequences (we decide that the changepoint is found 
if there is a changepoint within 3 observations of the true one) . While for the 
second situation, the algorithm detected the changepoint for 55 of the sequences. 
This simple simulation illustrated that sometimes whether we take into account 
the missing observation makes a big difference. The mechanism for dealing with 
missing data will not be used further in the rest of the paper. 



2.5 Numerics 

Direct implementation of our algorithm only works for sequences of a couple of 
hundred observations long. For longer sequences, the underflowing or overflow- 
ing of float numbers should be properly dealt with. In computing the MAP esti- 
mate. This problem is easily solved by using log probability and thus transforms 
the products into sums. This does not solve our problem in forward recursion 
and some other steps of the algorithm because sums are mixed together with 
products in the recursion. 

In the following we explain our approach to solve this problem. Although 
wc suspect that many other researchers have used this technique, we cannot 
find it docume nted in the lite rature. A different strategy used for HMM, well 
documented in iRabinei ( 1989[ ). does not seem to be adaptable to our situation. 



We illustrate our approach with the computation of (jl]). The main idea 
is actually the same as in computing the MAP. In the computer memory, the 
probability I /c) is stored as its log lp{yi:i\k) := logp(yi:i|fc). Computation 
of logp(?/i:i|fc) by recursion Q should proceed with care. We have the recursion: 

P{yi:i\k) ^^a,^kPiyi:j\k ~ l)p{yj+l:i\l), 

where ay^ = {lZl)/{lr\)- 

Suppose the log of p{yi:j\k— 1) and p{yj+i;i\l) in the above have been stored 
in the memory, we might be tempted to write everything in the exponential from, 
in which case we have 

P{yi:i,k) = ^exp{lp{yi.,j\k - 1) + lp{yj+i:i\l) + \oga,jk} 
j 

But taking exponential directly does not work — the reason that we choose to 
store lp{yi:i\k) instead of p(jji.i\k) = exp{lp(yi.i\k)} in the first place is that 
computation of exp{lp(jji-i\k)} might lead to overflow or underflow. This is still 
true with exp{lp{yi-j\k — 1) -I- lp{yj+i;i\l) -\- logayj.}. Here is the trick to avoid 
this. Let c = maXj{/p(yi,j|A: - 1) + lp{y]+i:i\l) + logOi^fe}, so 

Pih^) = ^exp{lp{yi.,j\k-l) + lp{yj+i:i\l) + \oga^jk} 
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= C^exp{lp{yi.,j\k - 1) + lp{yj+i..i\l) + logayfe - c})exp{c} 
j 

Now the numbers in the exponent lp{yi-j\k — 1) + + logaj^-fc — c 

is nonpositive, with the largest among them exactly 0. Although after taking 
exponential, there might be some positive terms that will become due to 
underflow, we can be sure that these terms do not have significant contributions 
to the sum, because the sum is dominated by terms that are close to 1. In this 
way we separate the terms that are the dominating ones from the others, and 
these terms are computed with high precision. The probability p(yi-i\k) will 
finally be stored in memory as 

log{y^ exp{lp{yi.,j\k - 1) + lp{yj+i;i\l) + loga^fe - c}) + c 



2.6 Markov dependence 



Our basic model in Section 12711 assumes that each segment has its own mean and 
variance, and observations within one segment are i.i.d. given these hidden pa- 
rameters. In this subsection, we will present an extension that takes into account 
higher order Markov dependency within one segment. Here we only consider 
one particular Markov model that is called autoregressive model which has been 
studied a lot in parametric time series theory. A good i ntroductory book on 
the cl assical theory for mathemati cally inclined readers is Brockwell and David 



([1987|), while the more recent book lFan a nd Yao (2003) focuses more on the non- 
parametric aspect of the theory. An autoregressive model of order p, AR{p), is 
defined as 

yt = Piyt-1 H h Ppyt-p + tt (10) 

where in general tt is a white noise process M^7V(0,(72), i.e. 

E{et) = 0, Var{et) = cr^ and Cow(ef , e^) = 0, for aU t ^ s 

We assume the more strict condition that et are i.i. d. iV(0,cr2). By the above 
definition, the i.i.d. model we studied before can be considered as the special 
case with order p equal to zero. Next we illustrate the model with p — 1 : 
yt = Pyt-i ARip) with p > 1 can be studied with more complicated vector 
algebra using a multidimensional normal prior. We also assume that the time 
series has mean zero for simplicity. 

For technical reasons, in a AR{1) model, we usually assume that |/3| < 1, 
which ensures that model (fTO|) admits a unique stationary and causal solution. 
This constraint is not going to be put on our model, since we will later assume 
a Gaussian prior for /5, which has infinite support. 

The model is same as before, except now we assume that for a segment [i, j], 
given the hidden parameters (3 and cr^ , the observations are generated by 

yi = Ei 
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Dt = /3yf_i + et,t = i + 1, . . . , J 
et iV(0,a2),i = z,...,j 

We put the following prior on the hidden parameters, same as that used in the 
basic model 

2 
fco 

(t'^\A - Inv ~ x^('^o,(To) 

The probability p{yi;j\\) can be computed analytically like ([3]). 

From the above discussion, it is evident that our model has great flexibility in 
modeling the dependency using an embedded Bayesian time series model. The 
only change we need to make is in computing Thus if the program 

is written in separate modules, the programming burden when we change to 
higher order Markov model is minimal, with only one module to be changed. 
No other parts of the code need to be modified at all. The model taking into 
account Markov dependency will not be pursued further in this paper. 

2.7 Repeated observations 

In the above, we apply the segmentation algorithm to one sequence at a time, 
excluding the EM stage where we may use multiple sequences to estimate the 
hyperparameters. The extension to multiple sequences is motivated by the ap- 
plication to tiling arrays where the same experiment is performed using multiple 
slides. The extension to the case where we have multiple sequences with the 
same underlying changepoints is straightforward. First we obtain the hyperpa- 
rameters as before (using EM if training data is available, or using the default 
choice © separately for each sequence). Thus these hyperparameters are fitted 
independently for each sequence. The variations between the sequences are re- 
flected in the differences of those hyperparameters. This can be regarded as a 
normalization method for different arrays. If the arrays have been normalized 
beforehand, we can use only one set of hyperparameters. Generally, for m repli- 
cas, we will have m sets of hyperparameters {O^'^^j^i. Assuming independence 
of observations between the replicas, the probability of observation can be fac- 
tored as p({yi:j^}^i|l) = P(yi:j^ |1: Q^'^'')- The recursions and samphng is 
same as before, replacing p(2/i:j|l) with p({yj-.'^^}"'|l). 

3 Examples 

We first demonstrate the effectiveness of the algorithm in detecting a single 
changepoint. The observations are generated as follows: 

"^''•iV(0,a2),i = l,2,...200, 
y, • N{pL, a^), i = 201, . . . , 400, ^ > 
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fi = 0.2 


/i = 0.5 


/i = 1 ^ — 2 


mean number of times sampled 








segmentation contains 2 segments 


44.6 


54.6 


73.2 99.3 


median distance of detected 








changepoint from truth 


58 


8.5 


0.6 0.1 



Table 1: Number of times sampled segmentation has 2 segments from 100 pos- 
terior samples, averaged over 100 sequences, together with the error of the 
changepoint location averaged over all samples with 2 segments. 



First, we set — 1 and vary /i, smaller fi makes the changepoint harder to 
detect. For this simulation, we set the hyperparameters as in ([9]). We gener- 
ate 100 sequences and draw 100 segmentation samples from the posterior for 
each sequence and count the number of times the algorithm correctly detect 
the changepoint. The mean number of times that the posterior sample con- 
tains 2 segments averaged over 100 sequences is reported in Table [TJ For every 
posterior sample that has 2 segments, we also calculated the distance of the 
detected changepoint from the true changepoint. From the table, we see that 
the changepoint model performs reasonably well with /i = 1 which is same as 
the standard deviation of the noise. 

Next we study multiple changepoint problems and the MCEM algorithm 
for estimating the hyperparameters. We generate 200 sequences with exactly 
5 changepoints, where the observations are generated from our Bayesian model 
stated in Section 12.11 That is, for each sequence, the changepoints are gen- 
erated from the uniform distribution as in ([T]), and the mean and variance 
for each segment is generated from the conjugate prior with hyperparameters 
fiQ = 0, fco = 0.5, vq = 5, ctq = 0.1. We compare two ways for hyperparameter 
estimation. For estimation using EM, we use the first 100 sequences for train- 
ing and the rest 100 sequence for testing the performance of the segmentation 
algorithm. For estimation using default choice ([9]) of the hyperparameter, we 
only use the second set of 100 sequences. The MAP estimate of the number of 
segments for those 100 sequences are shown in Figure [2j Both plots give reason- 
ably good estimate, although the results with estimated hyperparameters seems 
to be slightly better. 

Fin ally, we apply our model to the public CGH array data from lSniiders et al.l 



(|200lh . CGH arrays are used for the purpose of detecting changes in the number 
of DNA copies in the samples associated with cancer activity. DNA from a tu- 
mor sample and normal reference sample are labeled with different fluorophores 
and hybridized to the array. The ratio of the intensity of the tumor to that of 
the reference DNA is used to measure the copy number changes for a particular 
location in the genome. We choose to use chromosome 1 through 5 from the 
cell line gml3330. The missing observations are ignored and we take the ob- 
servations as equally spaced along the genome. The data from chromosomes 1 
through 5 are concatenated and treated as one single sequence, which is plotted 
in Figure [31 For this problem, we specify the hyperparameters with the default 
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number of segments 



number of segments 



(a) 



(b) 



Figure 2: Number of segments estimated as the posterior mode using hyper- 
parameters estimated using MCEM (a) or default choice (b), for 100 simulated 
sequences. 



choice ^ . The segmentation algorithm correctly identifies the visually obvious 
amplification at position 100 and deletion at around 430. But it does not pick 
out the single amplified observation at around 200. 



4 Discussion 

We use the simplest prior for the changepoints in this paper. More complicated 
priors can be specified for the segmentation. For example, we can use a Gamma 
distribution for the length of each segment (so the distribution on the number 
of changepoints is implicitly specified) , which is a popular choice in the litera- 
ture of generalized hidden Markov models (i.e. H MM with e x plicit durations), 
in application domains such as speech processing (iLevinsonI (Il986h ). In most 



applications the main target of inference is the segmentation (i.e. the location 
of the changepoints) as well as the segmental parameters such as the segmental 
means, both can easily be obtained from our model. If one considers change- 
point model as a tool for function approximation, piece- wise linear model seems 
to be more appropriate compared to the piece-wise constant model considered 
here. 

In future work, we hope to apply the model on biological experiment with 
multiple arrays as explained in section 12.71 when replicated data become avail- 
able. We are currently also investigating combining the hierarchical Bayesian 
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100 200 300 400 

Index 



Figure 3: CGH array data. 

model with HMM so tliat cacli segment is assigned a state, which might be useful 
for some biological applications such as searching for protein binding sites. 
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